Using the below code for checking if the website is scrapable or not
<polite session>
User-agent: polite R package
robots.txt: 149 rules are defined for 2 bots
Crawl delay: 5 sec
The path is scrapable for this user-agent
Acquiring the data
The below function is used for web-scrapping reviews from Amazon. I am acquiring review title, review text and number of stars for the review. I have collected the reviews for the following books:
A Game of Thrones: A Song of Ice and Fire, Book 1
A Clash of Kings: A Song of Ice and Fire, Book 2
A Storm of Swords: A Song of Ice and Fire, Book 3
A Feast for Crows: A Song of Ice and Fire, Book 4
A Dance with Dragons: A Song of Ice and Fire, Book 5
Twilight: The Twilight Saga, Book 1
New Moon: The Twilight Saga, Book 2
Eclipse: The Twilight Saga, Book 3
Breaking Dawn: The Twilight Saga, Book 4
The Hunger Games
Catching Fire: The Hunger Games
Mockingjay: The Hunger Games, Book 3
scrape_amazon <-function(ASIN, page_num){ url_reviews <-paste0("",ASIN,"/?pageNumber=",page_num) doc <-read_html(url_reviews) # Assign results to `doc`# Review Title doc %>%html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%html_text() -> review_title# Review Text doc %>%html_nodes("[class='a-size-base review-text review-text-content']") %>%html_text() -> review_text# Number of stars in review doc %>%html_nodes("[data-hook='review-star-rating']") %>%html_text() -> review_star# Return a tibbletibble(review_title, review_text, review_star,page = page_num, ASIN) %>%return()}
Using the above function I have scraped equal number of reviews for each series to compare them. I have used a for loop and sleep time of 2 seconds to avoid bot detection. Then converted the whole data into csv format.
Reading the data
reviews <-read_csv("amazonreview.csv")
Data Preprocessing
I have cleaned the text of the reviews by removing punctuations, numbers, UTF symbols, &, digits, new line characters, single length words, Pascal case words were removed from the tweets text using the stringr library departments communicate information to alleviate specific public functions. Stopwords are removed. The stopwords collection is taken from stopwords-iso and SMART. Stemming is not preferred here as the meaning of the word is important for analysis. And then categorized the data based on the book title and series title. The analysis done in the project is based on the categorized series titles to compare sentiments and topics on basis of series.
clean_text <-function (text) {# Remove urlstr_remove_all(text," ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)") %>%# Remove mentionsstr_remove_all("@[[:alnum:]_]*") %>%# Replace "&" character reference with "and"str_replace_all("&", "and") %>%# Remove punctuationstr_remove_all("[[:punct:]]") %>%# remove digitsstr_remove_all("[[:digit:]]") %>%# Replace any newline characters with a spacestr_replace_all("\\\n|\\\r", " ") %>%# remove strings like "<U+0001F9F5>"str_remove_all("<.*?>") %>%# Make everything lowercasestr_to_lower() %>%# Remove any trailing white space around the text and inside a stringstr_squish()}
Finding the top features from the above feature co-occurrence matrix and ploting the network plot.
# pull the top featurestop_features <-names(topfeatures(text_fcm, 50))# retain only those top features as part of our matrixeven_text_fcm <-fcm_select(text_fcm, pattern = top_features, selection ="keep")# check dimensionsdim(even_text_fcm)
[1] 50 50
# compute size weight for vertices in networksize <-log(colSums(even_text_fcm))# create plottextplot_network(even_text_fcm, vertex_size = size /max(size) *2)
I can observe that book has the highest count and then there are interesting words like great, enjooyed, amazing, good, love which express the feelings of the people which can help in our sentimental analysis.
Further study
I will be transforming and categorizing data and also plot some analysis plots and if possible also do some sentiment analysis.
Source Code
## Introduction
In this blog I plan to scrape reviews on different products in Amazon and do pre-processing of data.

## Loading the libraries I have collected the reviews for the following books:- A Game of Thrones: A Song of Ice and Fire, Book 1- A Clash of Kings: A Song of Ice and Fire, Book 2 - A Storm of Swords: A Song of Ice and Fire, Book 3 - A Feast for Crows: A Song of Ice and Fire, Book 4 - A Dance with Dragons: A Song of Ice and Fire, Book 5 - Twilight: The Twilight Saga, Book 1- New Moon: The Twilight Saga, Book 2 - Eclipse: The Twilight Saga, Book 3- Breaking Dawn: The Twilight Saga, Book 4- The Hunger Games- Catching Fire: The Hunger Games- Mockingjay: The Hunger Games, Book 3```{r}scrape_amazon <-function(ASIN, page_num){ url_reviews <-paste0("",ASIN,"/?pageNumber=",page_num) doc <-read_html(url_reviews) # Assign results to `doc`# Review Title doc %>%html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%html_text() -> review_title# Review Text doc %>%html_nodes("[class='a-size-base review-text review-text-content']") %>%html_text() -> review_text# Number of stars in review doc %>%html_nodes("[data-hook='review-star-rating']") %>%html_text() -> review_star# Return a tibbletibble(review_title, review_text, review_star,page = page_num, ASIN) %>%return()}```Using the above function I have scraped equal number of reviews for each series to compare them. I have used a for loop and sleep time of 2 seconds to avoid bot detection. I have used a for loop and sleep time of 2 seconds to avoid bot detection. Then converted the whole data into csv format.

## Reading the data Stopwords are removed. The stopwords collection is taken from `stopwords-iso` and `SMART`. Stemming is not preferred here as the meaning of the word is important for analysis. And then categorized the data based on the book title and series title. Stopwords are removed. The stopwords collection is taken from `stopwords-iso` and `SMART`. Stemming is not preferred here as the meaning of the word is important for analysis. And then categorized the data based on the book title and series title. The analysis done in the project is based on the categorized series titles to compare sentiments and topics on basis of series.

## Tidying the cleaned data
Dropping the NA in the cleaned text.

### Corpus of the data

The total number of tokens in the text are `1682431`.

Finding the frequency and rank of each word in the data

Finding the top features from the above feature co-occurrence matrix and ploting the network plot.

## Wordcloud of the data

I can observe that book has the highest count and then there are interesting words like great, enjooyed, amazing, good, love which express the feelings of the people which can help in our sentimental analysis.

## Further study
I will be transforming and categorizing data and also plot some analysis plots and if possible also do some sentiment analysis.